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Abstract. IChopinl J2007T) introduced a sequentially ordered hidden Markov 
model, for which states are ordered according to their order of appearance, 
and claimed that such a model is a re-parametrisation of a standard Markov 
model. This note gives a formal proof that this equivalence holds in Bayesian 
terms, as both formulations generate equivalent posterior distributions, but 
does not hold in Frequentist terms, as both formulations generate incom- 
patible likelihood functions. Perhaps surprisingly, this shows that Bayesian 
rc-parametrisation and Frequentist re-paramctrisation are not identical con- 
cepts. 
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A standard hidden Markov model assumes that the observed process (y t ) follows 
a mixture distribution: 

(1) Vt\{st=k}~f(vt\Zk) t = o,i,... 

where G ^} i s a parametric family of probability densities, and that the 

unobserved assignment variable (s t ) is a K-staie Markov chain: 

(2) p(st+i = l\st = k) = Qki, k,l = l,...,K. 

If necessary, distribution ([1} can also depends on j/t— 1> S/t— 2> • ■ • The transition 
matrix is denoted by Q = (qki), and the vector of unknown parameters, which 
includes both the q^s and the £fc's, is denoted by 9. Lastly, it is often assumed 
that 

(3) p{s = k) = § k , k=l,...,K, 

where $ = (i9i, ... , *&k) is the the stationary distribution o f the chain, i.e. the 



solutio n of dQ — See e . g. iMacDonald and Zucch ini ( 1997]) , iMcLachlan and Peel 
(|200d Chap. 13), IScottl (|2002l) or ICappe et all (|2005l ) for more background on 



hidden Markov models and their numerous applications. 

The sequentially ordered hidden Markov model introduced bv lChopinl |2007) has 
the same observation equation as |T]), i.e. 

(4) vt\{zt = k} ~ f(y t \U t = 0,l,-.. 

but differs with respect to the behaviour of the hidden process, now denoted by 
(z t ,m t ): 

(z ,m ) = (1, 1), 
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if k, I < m < K, 

(5) p(z t+1 = l\z t = k,m t = m) = { £il m +i q ki iik<l = m + l<K, 

otherwise, 

m t+ i = max(m(,z t+ i), 

for t = 0, 1, . . . For sake of clarity, parameters in this second formulation are 
barred, e.g. 9, £ fc , q kl , etc. The quantity m t represents the number of states that 
have appeared up to time t, and z t represents the current state, as labelled with 
respect to the order of appearance of state values; e.g. zq = 1, as first state to appear 
is always labelled '1', then z\ = 1 with probability qn, and z\ — 2 otherwise, etc. 
The pair (z t , mt) is a if'-state Markov chain, with K' = K(K + l)/2, since Zt < m t 
wi th probability one. 

Chopin! (|2007l ) discusses several advantages of time-ordered hidden Markov mod- 



els. First, they are identifiable, provided T > K, whereas standard hidden Markov 
models are invariant with respect to state re-labelling. Second, they are still hid- 
den Markov models, with hidden chain (z t , m t ), so standard algorithms for hidden 
Markov models (such as Gibbs sampling) can be adapted with little extra effort. 
Third, one may conveniently estimate m t in order to evaluate how many states are 
required to model the data. Fourth, in sequential settings, that is, where statistical 
inference is performed at each time t where a new data-point is available, states are 
automatically and consistently ordered at all iterations. 

A last advantage of these models, which was not mentioned bv lChopin ( 2007 ) is 



that they are slightly more parsimonious, since they do not require the specification 
of a distribution for the initial state so , as in ((3|) . 

The rest of the paper is organised as follows. Section 1 proves that the two for- 
mulations generate equivalent posterior distributions. Section 2 shows that the two 
formulations lead to likelihood functions that cannot be compared with each other. 
Section 3 discusses this paradox, and mentions possible extensions of sequentially 
ordered hidden Markov models. 



1. Bayesian equivalence 

We assume that (y t ) is the stochastic process defined by (QJ and ([2]), i.e. a 
standard hidden Markov model, where dependencies on parameter 9 are interpreted 
as conditional dependencies on random variable 9, with prior probability density 
7r. We prove that, under the following two mild conditions, one may define a latent 
process (z t ,m t ) and a random parameter 9 distributed according to 7r, such that the 
distribution of (yt) conditional on 9 corresponds to ([4]) and ([5]), i.e. a sequentially 
ordered hidden Markov model. 

Condition 1. The prior distribution 7r is invariant with respect to state re- 
labelling, that is, for any permutation r of the first K integers, if 9 ~ 7r, then 

Or = (£r(l), • • ■ ,Ct(K))9t(1)t(1)! • • • >Qt(K)t(K-1)) 

is also distributed according to tt. 

Condition 2. Under the prior distribution tt, all the components of matrix Q 
are positive with probability one. 

Let at be the random sequence of increasing size that records sequentially the 
state values as they appear for the first time in si :t ; e.g. if 2:1.5 = (4, 3, 4, 7, 3) then 
09 = (4, 3, 7). With an abuse of notations, cr t shall also stand for any permutation 
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t of the first K integers such that t(i) = ut(i), for 1 < i < k, where k is the length 
of sequence cr t . In particular, let z t = a i T 1 (s t ) and m t — maxi<c<t z v 

According to Condition 2 and basic properties of Markov chains, there exists 
almost surely a vector a of size K and a finite time £ such that at = cr for all t > £. 
Let 9 = a(9) and Q = <j(Q), which are obtained respectively by re-ordering the 
components of 9 and Q according to cr; e.g. Q = {qa(i)a(j))- Finally, denote by a a -.b 
any sub-vector (cr(a), . . . , cr(&)) of cr, where a, b are positive integers. 

We show now that (z t ,TOt) verify ([3]). Clearly, (zo,mo) = (1, 1) with probability 
1. Then, compare 

p(cr 2:K = t\zi = 2, s = k, 9) = p(a 2:K = t\si ^ k, s = k, 9) 

with 

P(C2:K = T\Z 1 = 1, S = k,9) = p((7 2 :K = t\s = Si = fc, 9), 

where r is a vector of size K — 1, and T' and '2' are the two only possible values 
for z\. Since (s t ) is Markov, conditional on so the order of appearance a 2 -.K of 
the states that differ from so does not depend on the (random) date 77 where St 
changes value for the first time; i.e. sq = si = ... = s j; _i =/= s^. In particular, both 
probabilities above equal 

p(<T2:K = t\s = k,9) 

and one deduces that, conditional on cr(l) (which equals so) and 9, a 2 ,x and z\ are 
independent random variables. Thus 

p(z 1 = l\a,9) =p{s x = sq\s ,9) = q SoS „ =q n 

almost surely. Since 9 is a deterministic function of a and 9, and g xl is a determin- 
istic function of 9, one has: 

p(zi = 1\9) = q n , 

and, 

K 

p(zi=2|0) = l-p(zi = l|0) = I>H- 

i=2 

This reasoning can be generalised to further time steps: for k,l < to, where to 
is the length of integer sequence s, a m +i : K and Zt+i are independent variables, 
conditional on z t = k, m t = to, and ai :m . Thus, for k, I < to, and for some 
arbitrary permutation r, 

(6) p{z t +i — l\z t — k,m t = m,a = t,9) 

= p(st+i = r(l)\s t = T(k),m t = m,a = t,9) 

= p(s t+ i = r(l)\s t = r(fc),TO t = m,a\. m = n :m ,6) 

= q T {k) T (i) 

which gives: 

p(z t+ l =l\z t = k,m t = TO, CT, 9) = q a ( k )a(l) = IkU 

with probability one, and, since probabilities sum up to one, 

K 

(7) p{zt+i =m+ l\z t = k,m t =m,a,6) = q kl . 

l=m+l 

One can replace (cr, 9) with 9 in the conditioning of (J6j) and ([7]), since the right-hand 
sides depend only on 9, a deterministic function of (cr, 6?). This gives the desired 
result. 
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2. Frequentist EQUIVALENCE 

The two formulations define two incompatible data generating processes. For 
instance, under the first model, the marginal distribution of yo is a mixture: 

K 

p(i/o|0) = $>k/(w>|60> 
fe=l 

whereas, under the second model, this distribution is /(t/o|£i)- However, this dis- 

(i) 

tinction seems relevant only when one would observe repeated samples y\. T for the 
same parameter value 0, which seldom happens in practice. More importantly, the 
two formulations define non equivalent likelihood functions, in the sense that, for 
i 6 8, the likelihood of the first model p(yv.T\0 = t) is generally not equal to the 
likelihood of the second model p{y\-.T\0 = t), if only because the former likelihood 
is invariant with respect to state relabelling, whereas the latter is not. 

In fact, our time-ordered model corresponds to a re-parametrisation of the full 
vector of unknowns, that is, (0, si-.t) transformed into (9, zi-t), and therefore the 
complete likelihood p{yi-.TZi-.t\9) is a re-parametrised version of p{yi-.T si-t\9) 1 but 
this property does not extend to the marginal likelihood function p(yi-T\9). A 
practical consequence is that maximum likelihood (or similar) estimators obtained 
from cither formulations cannot be compared easily; thus one should refrain in 
principle from applying Frequentist estimation procedures to sequentially ordered 
hidden Markov models. 

3. Discussion and extensions 

This example shows that Bayesian rc-paramctrisation is more powerful a con- 
cept than Frequentist re-parametrisation. Since Bayesian analysis treats equally all 
unknown quantities as random variables, whether they are 'parameters' or 'latent 
variables', one can re-parametrise (apply a one-to-one transform to) the full vector 
of unknowns, i.e. (9, z) above, rather than re-parametrise only the parameter vector 
9. This explains why both formulations are equivalent in terms of posterior distri- 
butions, but not in terms of (marginal) likelihood functions. The author believes 
that this is yet another example of the greater internal consistency of the Bayesian 
approach. 

One can easily derive sequentially ordered formulations for models closely related 
to hidden Markov models, e.g. hidden semi-Markov models, or continuous-time 
jump Markov models. The proof of the validity of these sequentially ordered for- 
mulations is a simple extension of Section 1. For instance, a hidden Markov model 
is a semi-Markov model where times between changes are geometrically distributed, 
but since our proof is not based on this particular assumption, it extends readily 
to semi-Markov models, up to cosmetic changes in the notations. In the same way, 
a continuous-time jump Markov process is a process that stays in a given state k, 
for a random, exponentially-distributed duration, and then switch to another state 
according to some probability transition matrix with zeros on its main diagonal. 
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